3 research outputs found

    Availability modeling and evaluation on high performance cluster computing systems

    Get PDF
    Cluster computing has been attracting more and more attention from both the industrial and the academic world for its enormous computing power, cost effective, and scalability. Beowulf type cluster, for example, is a typical High Performance Computing (HPC) cluster system. Availability, as a key attribute of the system, needs to be considered at the system design stage and monitored at mission time. Moreover, system monitoring is a must to help identify the defects and ensure the system\u27s availability requirement. In this study, novel solutions which provide availability modeling, model evaluation, and data analysis as a single framework have been investigated. Three key components in the investigation are availability modeling, model evaluation, and data analysis. The general availability concepts and modeling techniques are briefly reviewed. The system\u27s availability model is divided into submodels based upon their functionalities. Furthermore, an object oriented Markov model specification to facilitate availability modeling and runtime configuration has been developed. Numerical solutions for Markov models are examined, especially on the uniformization method. Alternative implementations of the method are discussed; particularly on analyzing the cost of an alternative solution for small state space model, and different ways for solving large sparse Markov models. The dissertation also presents a monitoring and data analysis framework, which is responsible for failure analysis and availability reconfiguration. In addition, the event logs provided from the Lawrence Livermore National Laboratory have been studied and applied to validate the proposed techniques

    Dependability Prediction of High Availability OSCAR Cluster Server

    No full text
    High availability (HA) computing has recently gained much attention, especially in enterprise and mission critical systems. The HA is now a necessity that is no longer regarded as a luxury feature. Thus, we, conjunctively with the open source community, are in process of enhancing the HA feature to Open Source Cluster Application Resources (OSCAR), a widely adopted Linux PC cluster system. Server redundancy will be our initial key aspect of the next generation HA OSCAR cluster system. In this paper, we introduce a HA server for OSCAR cluster system. Its architecture and mechanism is discussed, and then we model and predict the dependability of the system by a Petri net-based model, Stochastic Reword Net (SRN). The reliability and instantaneous availability of the system are presented as a result
    corecore